[ENH] Add basic maxscore writer/reader#6825
[ENH] Add basic maxscore writer/reader#6825HammadB wants to merge 10 commits intohammad/sparse_posting_blockfrom
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
9d41c2f to
f67eb32
Compare
0a24242 to
7e9e2b9
Compare
f67eb32 to
2fed177
Compare
2fed177 to
ffb0104
Compare
5f52182 to
789dabf
Compare
This comment has been minimized.
This comment has been minimized.
e71b3f4 to
6b0e4e6
Compare
|
Add This PR introduces a new sparse indexing/query path centered on The PR also extracts reusable scoring primitives into This summary was automatically generated by @propel-code-bot |
6b0e4e6 to
d57413d
Compare
d57413d to
9941a72
Compare
5f15fc9 to
c2105e3
Compare
9941a72 to
0dec95c
Compare
6ad7e5c to
be9d78b
Compare
c2105e3 to
36bf255
Compare
36bf255 to
df84501
Compare
4c6a02c to
e75da84
Compare
df84501 to
ad84956
Compare
e75da84 to
3faeec2
Compare
3faeec2 to
f3ea978
Compare
ad84956 to
894e694
Compare
| /// | ||
| /// Used by the suffix-rewrite optimization in `MaxScoreWriter::commit()` | ||
| /// to avoid loading blocks before the first affected offset. | ||
| pub async fn get_posting_blocks_from( |
There was a problem hiding this comment.
nit: this can named something like get_posting_blocks_range(start_seq, end_seq). Current name is a bit unclear
| if let Some(ref directory) = old_directory { | ||
| let old_block_count = directory.num_blocks() as u32; | ||
| let old_dir_part_count = if let Some(ref reader) = self.old_reader { | ||
| reader.count_directory_parts(encoded_dim).await? as u32 |
There was a problem hiding this comment.
not sure how much perf concern this is - both 1. reader.count_directory_parts() and 2. reader.get_directory() do a get_prefix call which could result in fetching blocks from s3. If there is cache churning after 1 and before 2 then 2 will also fetch from s3 the same blocks that 1 fetched. Just for getting a count. Is it possible to combine these two into one call maybe? Ignore this comment if it won't matter in practice
| #[derive(Clone)] | ||
| pub struct MaxScoreWriter<'me> { | ||
| block_size: u32, | ||
| delta: Arc<DashMap<u32, DashMap<u32, Option<f32>>>>, |
There was a problem hiding this comment.
this might also not matter much in practice - Arc<DashMap<u32, DashMap<u32, Option<f32>>>> v/s Arc<DashMap<u32, AsyncPartitionedMutex<Vector<(u32, Option<f32>)>>> might be more friendly for cpu caches. We don't need the inner hash map because different threads will be operating on different doc ids so they can simply append to a vector
There was a problem hiding this comment.
probably not a concern yet. the slow part is the commit
| posting_reader: BlockfileReader<'me, u32, SparsePostingBlock>, | ||
| } | ||
|
|
||
| impl<'me> MaxScoreReader<'me> { |
There was a problem hiding this comment.
naming nit: will it be useful to segregate the getters (get_* ) methods into owned v/s non owned ones by naming them suitably?
There was a problem hiding this comment.
what are the non-owned ones?
|
|
||
| let mut terms: Vec<TermState> = Vec::new(); | ||
| for (idx, result) in cursor_results.into_iter().enumerate() { | ||
| let Some(mut cursor) = result? else { |
There was a problem hiding this comment.
why do we ignore the error here
| } | ||
| if mask.contains(doc) { | ||
| let idx = (doc - window_start) as usize; | ||
| bitmap[idx >> 6] |= 1u64 << (idx & 63); |
There was a problem hiding this comment.
a comment explaining this bitwise arithmetic would be useful for future readers
894e694 to
35663b7
Compare
f3ea978 to
116fb2a
Compare
116fb2a to
d65311f
Compare

Description of changes
This is PR #2 of a series unbundled from the
hammad/sparse-maxscore-prototypebranch. Stacked onhammad/sparse_posting_blockwhich addsSparsePostingBlockand its blockstore integration.This PR adds the writer, reader, cursor, and eager-mode BlockMaxMaxScore query implementation for sparse posting lists. Subsequent PRs will add lazy I/O cursors, a 3-batch pipeline, SIMD budget pruning, and segment-level integration.
BlockSparseWriter: Accumulates per-dimension deltas in aDashMap, merges with an optional previousBlockSparseReaderon commit, re-chunks entries into fixed-size blocks (default 1024), and writes both data blocks and a directory block per dimension. Supports incremental add/delete.BlockSparseReader: Reads posting blocks per dimension via the blockfile prefix API. Providesopen_cursor()for single-dimension iteration and a fullquery()method implementing the windowed BlockMaxMaxScore algorithm.PostingCursor: Eager cursor backed by fully decompressedSparsePostingBlocks. Supportsadvance()with roaring bitmap masks,drain_essential()for Phase 1 window accumulation,score_candidates()for Phase 2 non-essential merge-join, andwindow_upper_bound()for per-window essential/non-essential repartitioning.query()— BlockMaxMaxScore: Windowed (4096-slot) block-max MaxScore implementation with essential/non-essential term partitioning, bitmap-tracked flat accumulator, budget-based candidate pruning, and min-heap top-k extraction.SparseRescorertrait andrescore_and_select()for oversampled retrieval with exact-score refinement. Not currently used.maxscore.md): Documents the algorithm, data layout, and per-phase query walkthrough. High level for posterity.Test plan
11 integration test files covering the writer, reader, cursor, and query engine:
ms_01_blockfile_roundtrip— Serialize/deserializeSparsePostingBlockthrough blockfile write→flush→read.ms_02_writer_basic— Write sparse vectors, commit, read back via reader.ms_03_writer_incremental— Incremental updates: add, delete, overwrite across commits.ms_04_writer_edge_cases— Single-entry dimensions, high-cardinality dimensions, empty deltas.ms_05_cursor—PostingCursoradvance, drain_essential, score_candidates, window_upper_bound.ms_06_correctness— End-to-endquery()results match brute-force dot-product.ms_07_masks—query()with Include/Exclude roaring bitmap masks.ms_08_edge_cases— Empty index, k=0, missing dimensions, single-doc queries.ms_09_recall— Recall@k measurement against brute-force on randomized data.ms_10_incremental_query— Query correctness after incremental writer updates.ms_11_vectorized_scoring— Verifiesdrain_essentialandscore_candidatesaccumulator arithmetic.Tests pass locally with
cargo testMigration plan
No migrations needed. This is a new index type with no existing on-disk data. Segment-level wiring is deferred to a later PR.
Observability plan
No new instrumentation in this PR. Tracing spans and metrics (block skip rate, essential/non-essential term counts, per-window candidate counts) will be added alongside segment integration in a follow-up PR.
Documentation Changes
No user-facing API changes. Internal design is documented in
rust/index/src/sparse/maxscore.md.